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Abstract — This paper describes a methodology for detecting 
anomalies from sequentially observed and potentially noisy 
data. The proposed approach consists of two main elements: (1) 
filtering, or assigning a belief or likelihood to each successive 
measurement based upon our ability to predict it from 
previous noisy observations, and (2) hedging, or flagging 
potential anomalies by comparing the current belief against 
a time-varying and data-adaptive threshold. The threshold is 
adjusted based on the available feedback from an end user. 
Our algorithms, which combine universal prediction with recent 
work on online convex programming, do not require computing 
posterior distributions given all current observations and involve 
simple primal-dual parameter updates. At the heart of the 
proposed approach lie exponential-family models which can 
be used in a wide variety of contexts and applications, and 
which yield methods that achieve sublinear per-round regret 
against both static and slowly varying product distributions with 
marginals drawn from the same exponential family. Moreover, 
the regret against static distributions coincides with the minimax 
value of the corresponding online strongly convex game. We 
also prove bounds on the number of mistakes made during the 
hedging step relative to the best offline choice of the threshold 
with access to all estimated beliefs and feedback signals. We 
validate the theory on synthetic data drawn from a time-varying 
distribution over binary vectors of high dimensionality, as well 
as on the Enron email dataset. 

Keywords: anomaly detection, exponential families, filtering, 
individual sequences, label-efficient prediction, minimax regret, 
online convex programming, prediction with limited feedback, 
sequential probability assignment, universal prediction. 



I. Introduction 

In this paper, we explore the performance of online anomaly 
detection methods built on sequential probability assignment 
and dynamic thresholding based on limited feedback. We 
assume we sequentially monitor the state of some system 
of interest. At each time step, we observe a possibly noise- 
corrupted version z t of the current state x t , and need to 
infer whether xt is anomalous relative to the actual sequence 
x*^ 1 = (x\, . . . , Xt-i) of the past states. This inference is 
encapsulated in a binary decision y t , which can be either —1 

This work was supported by NSF CAREER Award No. CCF-06-43947, 
NSF Award No. DMS-08-1 1062, and DARPA Grant No. HR00 11-07-1-003. 
Portions of this work were presented at the IEEE International Symposium 
on Information Theory, Seoul, Korea, June/July 2009. 

M. Raginsky, R. Willett, C. Horn, and J. Silva are with the Department of 
Electrical and Computer Engineering, Duke University, Durham, NC 27708 
USA (e-mail: m.raginsky@duke.edu, willett@duke.edu, ceh23@duke.edu, 
jg.silva@duke.edu). 

R. Marcia is with the School of Natural Sciences, University of California, 
Merced, CA 95343 USA (e-mail: rmarcia@ucmerced.edu). 



(non-anomalous or nominal behavior) or +1 (anomalous be- 
havior). After announcing our decision, we may occasionally 
receive feedback on the "true" state of affairs and use it to 
adjust the future behavior of the decision-making mechanism. 

Our inference engine should make good use of this feed- 
back, whenever it is available, to improve its future perfor- 
mance. One reasonable way to do it is as follows. Having 
observed z t_1 (but not z t ), we can use this observation to 
assign "beliefs" or "likelihoods" to the clean state x t . Let 
us denote this likelihood assignment as p^x^z 1 " 1 ). Then, if 
we actually had access to the clean observation x t , we could 
evaluate p t = p t (x t \z t ^ 1 ) and declare an anomaly (y t = +1) 
if p t < r t , where r t is some positive threshold; otherwise we 
would set y t = — 1 (no anomaly at time t). This approach is 
based on the intuitive idea that a new observation x t should 
be declared anomalous if it is very unlikely based on our 
past knowledge (namely, z t ~ 1 ). In other words, observations 
are considered anomalous if they are in a portion of the 
observation domain which has very low likelihood according 
to the best probability model that can be assigned to them on 
the basis of previously seen observations. (In fact, anomaly 
detection algorithms based on density level sets revolve around 
precisely this kind of reasoning.) The complication here, 
however, is that we do not actually observe x t , but rather 
its noise-corrupted version z t . Thus, we settle instead for an 
estimate p t of p t based on z t and compare this estimate against 
r t . If we receive feedback y t at time t and it differs from our 
label y t , then we adjust the threshold appropriately. 

A. Contributions 

There are several challenging aspects inherent in the prob- 
lem of sequential anomaly detection: 

• The observations cannot be assumed to be independent, 
identically distributed, or even come from a realization 
of a stochastic process. In particular, an adversary may 
be injecting false data into the sequence of observations 
to cripple our anomaly detection system. 

• Observations may be contaminated by noise or be ob- 
served through an imperfect communication channel. 

• Declaring observations anomalous if their likelihoods fall 
below some threshold is a popular and effective strategy 
for anomaly detection, but setting this threshold is a 
notoriously difficult problem. 

• Obtaining feedback on the quality of automated anomaly 
detection is costly as it generally involves considerable 
effort by a human expert or analyst. Thus, if we have 
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an option to request such feedback at any time step, we 
should exercise this option sparingly and keep the number 
of requests to a minimum. Alternatively, the times when 
we receive feedback may be completely arbitrary and not 
under our control at all — for instance, we may receive 
feedback only when we declare false positives or miss 
true anomalies. 

In this paper, we propose a general methodology for address- 
ing these challenges. With apologies to H.P. Lovecraft [1], 
we will call our proposed framework FHTAGN, or Filtering 
and Hedging for Time-varying Anomaly recoGNition. More 
specifically, the two components that make up FHTAGN are: 

• Filtering — the sequential process of updating beliefs on 
the next state of the system based on the noisy observed 
past. The term "filtering" comes from statistical signal 
processing [2] and is intended to signify the fact that the 
beliefs of interest concern the unobservable actual system 
state, yet can only be computed in a causal manner from 
its noise-corrupted observations. 

• Hedging — the sequential process of flagging potential 
anomalies by comparing the current belief against a time- 
varying threshold. The rationale for this approach comes 
from the intuition that a behavior we could not have 
predicted well based on the past is likely to be anomalous. 
The term "hedging" is meant to indicate the fact that the 
threshold is dynamically raised or lowered, depending on 
the type of the most recent mistake (a false positive or a 
missed anomaly) made by our inference engine. 

Rather than explicitly modeling the evolution of the system 
state and then designing methods for that model (e.g., using 
Bayesian updates [2], [3]), we adopt an "individual sequence" 
(or "universal prediction" [4]) perspective and strive to per- 
form provably well on any individual observation sequence 
in the sense that our per-round performance approaches that 
of the best offline method with access to the entire data 
sequence. This approach allows us to sidestep challenging 
statistical issues associated with dependent observations or 
dynamic and evolving probability distributions, and is robust 
to noisy observations. 

We make the following contributions: 

1 ) We cast both filtering and hedging as instances of Online 
Convex Programming (or OCP), as defined by Zinkevich [5]. 
This will permit us to implement both of these ingredients 
of FHTAGN using a powerful primal-dual method of Mirror 
Descent [6], [7] and quantify their performance in a unified 
manner via regret bounds relative to the best offline strategy 
with access to the full observation sequence. 

2) We show that the filtering step can be implemented as 
a sequential assignment of beliefs, or probabilities, to the 
system state based on the past noisy observations, where 
the probabilities are computed according to an exponential 
family model whose natural parameter is dynamically de- 
termined based on the past. We present a strategy based 
on mirror descent for sequentially assigning a time-varying 
product distribution with exponential-family marginals to the 
observed noisy sequence of system states and prove regret 
bounds relative to the best i.i.d. model as well as to the best 



sufficiently slowly changing model that can be assigned to 
the observation sequence in hindsight. These regret bounds 
improve and extend our preliminary results [8]; in addition to 
tightening the bounds presented in that work, we extend the 
results to more general settings in which data may be corrupted 
by noise. 

3) We show that the hedging step can be implemented 
as a sequential selection of the critical threshold, such that 
whenever the estimated belief for the current state falls below 
this threshold, we declare an anomaly. We develop methods 
to incorporate available feedback and establish regret-type 
bounds on the number of mistakes relative to the best threshold 
that can be selected in hindsight with access to the entire 
sequence of assigned beliefs and feedback. 

As described in the useful survey by Chandola et al. [9], 
several methods for anomaly detection have been developed 
using supervised [10], semi-supervised [11], and unsupervised 
[12] learning methods. In the online, individual-sequence 
setting we adopt, however, there is no intrinsic notion of 
what constitutes an anomaly. Instead, we focus on extrinsic 
anomalous behavior relative to the best model we can guess 
for the next observation based on what we have seen in the 
past. We are not aware of any anomaly detection performance 
bounds in non-stationary or adversarial settings prior to our 
work. 

B. Notation 

We will follow the following notational conventions. "Ba- 
sic" sets will be denoted by sans-serif uppercase letters, 
e.g., U,X, Z, while classes of sets and functions will be 
denoted by script letters, e.g., C,J-,G- Given a set X, we will 
denote by X fc the fc-fold Cartesian product of X with itself 
and by x a representative fc-tuple from X k . The set of all 
(one-sided) infinite sequences over X will be denoted by X°°, 
and a representative element will be written in boldface as 
x = (xi,X2, ■ ■ ■)■ The interior of a set U will be denoted 
by Int U. The standard Euclidean inner product between two 
vectors u,v e W n will be denoted by (u,v). 

II. Preliminaries 

This section is devoted to setting up the basic terminology 
and machinery to be used throughout the paper. This includes 
background information on online convex programming (Sec- 
tion II-A) and on exponential families (Section II-C). 

A. Online convex programming 

The philosophy advocated in the present paper is that the 
tasks of sequential probability assignment and threshold selec- 
tion can both be viewed as a game between two opponents, the 
Forecaster and the Environment. The Forecaster is continually 
predicting changes in a dynamic Environment, where the effect 
of the Environment is represented by an arbitrarily varying 
sequence of convex cost functions over a given feasible set, 
and the goal of the Forecaster is to pick the next feasible 
point in such a way as to keep the cumulative cost as low as 
possible. This is broadly formulated as the problem of online 
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convex programming, or OCP [5], [13], [14]. An OCP problem 
with horizon T is specified by a convex feasible set U C M. d 
and a family of convex functions T = {/ : U — > M}, and is 
described as follows: 

Algorithm 1 An abstract Online Convex Programming prob- 
lem 

The Forecaster picks an arbitrary initial point u\ G U 
fort = l,2,...,Tdo 

The Environment picks a convex function / f e J 
The Forecaster observes f t and incurs the cost ft(u~t) 
The Forecaster picks a new point u~t+i G U 
end for 



The total cost incurred by the Forecaster after T rounds is 
given by Ylt=i ( nere an d m tne sequel, hats denote 

quantities selected by the Forecaster on the basis of past 
observations). At each time t, the Forecaster's move u t must 
satisfy a causality constraint in that it may depend only on his 
past moves u t_1 and on the past functions / t_1 selected by 
the Environment. Thus, the behavior of the Forecaster may be 
described by a sequence of functions 



U. 



t=l,2,...,T 



so that u t = |U{(u t_1 , / t_1 ). We will refer to any such 
sequence fi T — {fi t : U* _1 x J 7 * -1 — ^ as a T- 

step strategy of the Forecaster. Informally, the goal of the 
Forecaster is to do almost as well as if he could observe 
the cost functions a U at once. For instance, we 

might want to minimize the difference between the actual 
cost incurred after T rounds of the game and the smallest 
cumulative cost that could be achieved in hindsight using a 
single feasible point. To that end, given a strategy /i T and a 
cost function tuple f T , let us define the regret w.r.t. a time- 
varying tuple u T = (ui, . . . , ut) G U t 



4=1 
T 



t=l 



= 2jto(fi*- 1 ,/ t - 1 



)) 



T 

E 

(=i 



Then the goal would be to select a suitable restricted subset 
Ct C U t and design /i T to ensure that the worst-case regret 

sup sup Rt(h T -J T ,u t ) 

fTfZjrT u T eCT 



T T 1 

E /«- ^ E /«w 

4=1 " T 4=1 J 



= sup 

is sublinear in T. (When u t — u for all t and for some u G U, 
we will write the regret as Rt(h t ; f T ,u), and the second sup 
in the worst-case regret would be over all u G U.) 

Note that it is often convenient to think of a comparison 
tuple u T G Ct as a strategy of the form fi t : U t_1 x J 7 * -1 — » 
{ut}, i.e., u t does not depend on the previous points or cost 
functions, but only on the time index t. This allows us to 
speak of comparison classes of strategies; however, to avoid 
confusion, we will always use the notation /i T (possibly with 



subscripts) to distinguish the Forecaster's observation-driven 
strategy from a comparison strategy u T , which may be time- 
varying but is always observation-independent. This is similar 
to the distinction made in control theory between closed- 
loop (or feedback) policies and open-loop policies [15]: a 
closed-loop policy is a sequence of functions for selecting 
the next control signal based on the past control signals and 
past observations, while an open-loop policy is a sequence of 
control signals fixed in advance. From this point of view, the 
regret pertains to the difference in cumulative costs between a 
feedback policy (/i T ) and the best open-loop policy in some 
reference class C. 

Remark 1 (Hannan consistency). More generally, we can 
consider unbounded-horizon strategies fi = {fi t ■ U* _1 x 
J 7 * -1 — > U}. Then, given a comparison class C C U°° of 
open-loop strategies, the design goal is to ensure that 

Rr(j*;C)± sup sup i? T (/i T ; / T , u T ) =o{T). (1) 

Any strategy fi that achieves (1) over a comparison class C 
is said to be Hannan-consistent w.r.t. T and C; see the text 
of Cesa-Bianchi and Lugosi [16] for a thorough discussion. 
One important comparison class is composed of all static (or 
constant) sequences, i.e., all u G U°° such that u\ = 112 = ■ ■ ■■ 
This class, which we will denote by C stat , is in one-to-one 
correspondence with the feasible set U, so 

Rt{m Cstat) = sup supi? T (/x T ;/ T ,u), 
and we will also denote this worst-case regret by Rt{h', U). 



B. The mirror descent procedure 

A generic procedure for constructing OCP strategies is 
inspired by the so-called method of mirror descent [6], [7], 
[17]. In the context of OCP, the rough idea behind mirror 
descent is as follows. At time t the Forecaster chooses the 
point 

u t+ i=argmin r) t (g t (u t ),u) + D(u,u t ) , (2) 
ueu L J 

where gt(u~t) is an arbitrary subgradient 1 of ft at tit, D(-, •) > 
is some measure of proximity between points in U, and r\ t > 
is a (possibly time-dependent) regularization parameter. The 
intuition behind (2) is to balance the tendency to stay close to 
the previous point against the tendency to move in the direction 
of the greatest local decrease of the cost. The key feature of 
mirror descent methods is that they can be flexibly adjusted 
to the geometry of the feasible set U through judicious choice 
of the proximity measure £)(•,•). In particular, when U is the 
canonical parameter space of an exponential family, a good 
proximity measure is the Kullback-Leibler divergence. The 
general measures of proximity used in mirror descent are given 
by the so-called Bregman divergences [19], [20]. Following 



'A subgradient of a convex function / : U 
any vector g £ W 1 , such that 

/(f) > f(u) + (g,v-u) 

holds for all v S IntU [18]. 



at a point u £ Int U is 
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[16], we introduce them through the notion of a Legendre 
function: 

Definition 1. Let U C W l be a nonempty set with convex 
interior. A function F : U — > K is called Legendre it is: 

1) Strictly convex and continuously differentiate through- 
out Int U; 

2) Steep (or essentially smooth,) — that is, if u\, 11%, . . . £ 
Int U is a sequence of points converging to a point on 
the boundary of U, then ||V-F(wj)|| — > oo as i — > oo, 
where \\ ■ \\ denotes any norm. 2 

The Bregman divergence induced by F is the nonnegative 
function Dp : U x Int U — > R, given by 

D F (u,v) = F(u)-F(v)-(VF(v),u-v),Vu £ U,v £ IntU. 

For example, if U = R d , then F(u) = (l/2)||w|| 2 , where 
|| ■ || is the Euclidean norm, is Legendre, and Dp(u,v) = 



(1/2)||< 



In general, for a fixed v £ Int U, Dp(-, v) gives 



the tail of the first-order Taylor expansion of F(-) around v. 
We will use the following properties of Bregman divergences 
(see [16] for proofs): 

1) For all u £ U and all v, w £ Int U, we have 

Dp(u, v) + Dp(v, w) — Dp(u, w) 

+ {VF(w)-VF(v),u-v). (3) 

2) Let S C U be a closed convex set. The Bregman 
projection of w £ Int U onto S is defined by 

n^ s( w ) = Bigmin Dp (u,w). 
ueS 

For any w £ Int U, IT^s^) exists, is unique, and sat- 
isfies the following generalized Pythagorean inequality: 
for all u £ S, w £ Int U, 

D F (u, w) > D F (u, ILvsM) + D F (n F , s (w), w). (4) 

We now present the general mirror descent scheme. We allow 
the possibility of restricting the feasible points to a closed, 
convex subset S of Int U. The mirror descent updates make 
use of F*, the Legendre-Fenchel dual of F [18], [21]: 

F*(z) = sup{(u,z) -F(u)}. 

Let U* denote the image of Int U under the gradient mapping 
VF: U* = VF(Int U). An important fact is that the gradient 
mappings VF and VF* are inverses of one another [7], [16], 
[17]: 



VF*(VF(u)) = u 
VF(VF(w)) = to 



Vu £ Int U.iiiE Int U* 



Following [16], we will refer to the points in IntU as the 
primal points and to their images under VF as the dual 
points. In the context of mirror descent schemes, the Legendre 
function F is referred to as the potential function. The mirror 
descent strategy for OCP is presented below as Algorithm 2. 
The name "mirror descent" reflects the fact that, at each 
iteration, the current point in the primal space is mapped to 

2 Since all norms on finite-dimensional spaces are equivalent, it suffices to 
establish essential smoothness in a particular norm, say the usual £2 norm. 



its "mirror image" in the dual space; this is followed by a step 
in the direction of the negative subgradient, and then the new 
dual point is mapped back to the primal space. 

Algorithm 2 A Generic Mirror Descent Strategy for OCP 
Require: A Legendre function F : U — > R; a decreasing 
sequence of strictly positive step sizes {r)t} 
The Forecaster picks an arbitrary initial point ui £ S 
for t = 1,2, ... do 

Observe the cost function ft £ F 

Incur the cost ft(ut) 

Compute £4 = VF(tt t ) 

Dual update: £ t+ i = & - r] t g t (u t ) 

Projected primal update: compute Ut+i — WF*(^ t +i) 

and 



u t+ i = n F>s (S t+ i) = argminL» F (tt,tt t+ i) 



end for 



In the case when U = R d and F(-) = (1/2)|| • || 2 , the 
above algorithm reduces to the standard projected subgradient 
scheme 



Ut+i =u t - i]t9t(u t ) 



u t +i 



argmm \ \u — Ut+\\ 



It can be also shown [7], [16] that, for a general Legendre 
potential F, the mirror descent strategy can be expressed as 



arg mm 

uGS 



Vt(gt(u t ),u} + D F (u,u t ) 



Note that pt+i is a function only of the most recent point- 
function pair (u t ,ft). The following lemma (see, e.g., Theo- 
rem 11.1 in [16]) is a key ingredient in bounding the regret 
of the mirror descent strategy: 

Lemma 1. For any u £ S and any t, we have the bound 

ft(u t ) < ft(u) + —(Dp(u,u t ) - D F (u,Ut+i) 

Vt v 

+ D F {u t ,u t+1 )y (5) 



Proof: Appendix A. ■ 

This basic bound, combined with additional assumptions on 

the functions in F, will allow us to construct Hannan- 
consistent OCP strategies. 

C. Background on exponential families 

This section is devoted to a brief summary of the basics of 
exponential families; Amari and Nagaoka [22] or Wainwright 
and Jordan [23] give more details. 

We assume that the observation space X is equipped with a 
(T-algebra B and a dominating <7-finite measure v on (X, B). 
Given a positive integer d, let 4> : X — > R d be a measurable 
function, and let cf>k, k= 1,2, ... ,d, denote its components: 

<(>(x) = (4>i{x), . . .,(f> d (x)) T . 
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Let be defined as the set of all 9 e R d such that 

exp {(0, (f>(x))}dv(x) < +oo. 



We then have the following definition: 

Definition 2. The set V((f>) of probability distributions on 
(X, B) parametrized by 9 £ 0, such that the probability 
density function of each Pg G V((j)) w.r.t. the measure v can 
be expressed as 



where 



pg(x)=e X p{(9,^x)}-^(9)}, 



$(<?)^log / exp{(0,0(a:)>}dj/(aO, 



is called an exponential family with sufficient statistic cj>. The 
parameter 9 G is called the natural parameter of V {<!>), and 
the set is called the natural parameter space. The function 
$ is called the log partition function. 3 

We will denote by Eg[] the expectation w.r.t. Pg: 

E e [g(X)] = f g(x)dP e (x) 



l{x)exp{{0,<t>(x))-9(0)}du(x). 



Example 1. The simplest example is the Bernoulli distribu- 
tion. In this case, X = {0, 1}, v is the counting measure, 
<p(x) = x, and $(0) = log[l + exp(0)]. The natural parameter 
space is the entire real line. Under this parametrization, we 
have p e (X = 1) = e e /(l + e e ). □ 

Example 2 (The Ising model). Consider an undirected graph 
G = (V, E) and associate with each vertex a G V a 
binary random variable X a G { — l.+l}. In this case, X = 
{ — l.+l} 17 , v is the counting measure, and we have the 
following density for the random variable X = (X a : a G 
V)e{-l,+l} v : 

p e (x) = exp < 9 a x a + ^ m \ , (6) 

[aeV (a.J3)eE J 

where 9 is the collection of d — \V\ + \E\ real parameters 
(9 a : a G V) and (9 a p : (a,/3) € E). The log partition 
function is 

$(9) = log cx p { 6aXa + H 



The sufficient statistic <ft is given by the functions <p a : x H> 
x a ,a € V and <j) a p : x i-> x a xp,(a, /3) € E. Since $(9) 
is finite for any choice of 9, we have = M. d , with the 
components of 9 appropriately ordered. □ 

Example 3 (Gaussian Markov random fields). Again, consider 
an undirected graph G = (V, E). For notational convenience, 
let us number the vertices as V = {l,...,p}. A Gaussian 
Markov random field (MRF) on G is a multivariate Gaussian 
random variable X — (Xi,...,X p ) T £ MP, where the 

3 This usage comes from statistical physics. 



covariates X a and Xp are independent if (a, f3) ^ E. Then 
X = R p and v is the Lebesgue measure. The distribution of 
X can be written exactly as in (6), the log partition function 
<£>(6>) is finite only if the p x p matrix T = [9 a p\ v a is 
negative definite (T -< 0), so that the parameter space is 
= {9 = {(3, T) e R p x Rp x p : T -< 0, T = T T }. ' □ 

D. General properties of exponential families 

The motivation behind our use of exponential families is 
twofold: (1) They form a sufficiently rich class of parametric 
statistical models which includes Markov random fields with 
pairwise interactions and can therefore be used to describe co- 
occurrence data, visual scene snapshots, biometric records, and 
many other categorical and numerical data types. Moreover, 
they can be used to approximate many nonparametric classes 
of probability densities [24]. (2) The negative log-likelihood 
function is convex in the natural parameter and linear in the 
sufficient statistic. This structure permits the use of OCR 
We will need the following facts about exponential families 
(proofs can be found in the references listed at the beginning 
of Section II-C): 

1) The log partition function <k(9) is lower semicontinuous 
on M. d and infinitely differentiable on 0. 

2) The derivatives of $ at 9 are the cumulants of the ran- 
dom vector 4>(X) = (<f>i(X), . . . , <fi(X)) when X ~ pg. 
In particular, 

V*(0) = (Eg(f>i(X), . . . , Eg(f) d (X)) T 

V 2 $(c?) = [Cov e (^(X),^(X))]J. =1 . 

Thus, the Hessian V 2< £ > (6'), being the covariance matrix 
of the vector 4>(X), is positive semidefinite, which 
implies that Q(9) is a convex function of 9. In particular, 
0, which, by definition, is the essential domain of $, is 
convex. 

3) $(6*) is steep (or essentially smooth): if {9 n } C is a 
sequence converging to some point 9 on the boundary 
of 0, then ||V$(0 n )|| -> +oo as n ->• oo. 

4) The mapping 9 n- V$(0) is invertible; the inverse 
mapping is p H> V<I>*(/i), where 



sup{(p,9) 
flee 



my 



is the Legendre-Fenchel dual of $. The gradient map- 
ping V$ maps the primal parameter 9 G to the 
corresponding dual parameter p E 0* = V3>(0). 
5) The relative entropy (Kullback-Leibler divergence) be- 
tween p 8l and pg 2 in V{4>), defined as D(pg 1 \\pg 2 ) = 
J x pg x \og(pg 1 /pg 2 )di', can be written as 

D{p Bl \\p e . 2 ) = $(0 2 ) - $(00 - (V$(0i), 9 2 - 9,) (7) 

From now on, we will use the shorthand £>(f?i||02)- 

From these properties, it follows that $ : — s- R is a Legendre 
function, and that the mapping Z)$ : x Int — s- R, defined 
by -D$(6*i, 02) = £>(02||0i), is a Bregman divergence. 
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III. Filtering: sequential probability assignment 

IN THE PRESENCE OF NOISE 

The first ingredient of FHTAGN is a strategy for assigning 
a likelihood (or belief) Pt(-|z t_1 ) to the clean symbol x t 
based on the past noisy observations z t_1 . Alternatively, we 
can think of the following problem: if xt is the actual clean 
symbol that has been generated at time t, then our likelihood 
p t = ptixtlz 1 ' 1 ), though well-defined, is not accessible for 
observation. Thus, we would like to estimate it via some 
estimator p t , which will depend on the actual observed noisy 
symbol z t , as well as on the previously obtained estimates 
P* -1 = (pi> • • • iPt-i)' m me field of signal processing, 
problems of this kind go under the general heading of filtering; 
this term refers to any situation in which it is desired, at 
each time t, to obtain an estimate of some clean unobservable 
quantity causally based on noisy past observations. 



A. Sequential probability assignment: general formulation 

1) Noiseless observations: Let us first consider the noise- 
less case, i.e., z t — x t for all t. Elements of an arbitrary 
sequence x = (xi, X2, ■ ■ •) G X°° are revealed to us one 
at a time, and we make no assumptions on the law that 
generates x. At each time t = 1,2,..., before xt is revealed, 
we have to assign a probability density p t (w.r.t. a fixed 
dominating measure v) to the possible values of xt- When 
Xt is revealed, we incur the logarithmic loss — logpi(xt). Let 
T> denote the set of all valid probability densities w.r.t. v. Then 
the sequential probability assignment can be represented by a 
sequence it of mappings ir t : X t_1 — > T>, so that 

p t = irtix*- 1 ), or pt(x t ) = [7r t (x t ~ 1 )](x t ). 

We refer to any such sequence of probability assignments tt 
as a prediction strategy. Since the probability assignment p t is 
a function of the past observations x t_1 , we may also view it 
as a conditional probability density p t (-|x t_1 ). In the absence 
of specific probabilistic assumptions on the generation of x, 
it is appropriate to view 



define the regret w.r.t. p = {pt} G C after T time steps as 



PMlx*- 1 ) ± / p t (xW- l )dv{x) 

J A 

as our belief, based on the past observations x' _1 , that the next 
observation x t will lie in a measurable set A C X. Another 
way to think about 7r is as a sequence of joint densities 

T 

p T (x T ) = l[pt(xt\x t - 1 ), T=l,2,.... 
t=i 

In an individual-sequence setting, the performance of a 
given prediction strategy is compared to the best performance 
achievable on x by any strategy in some specified comparison 
class C [4], [16]. Any such comparison strategy is also 
specified by a sequence of conditional densities p t (xi|x t_1 ) 
of Xt given x t_1 . Suppose first that the horizon T is fixed in 
advance. Given a prediction strategy 7r = {Ttt}t^i, we can 



R T {ir T ;x T ,p T ) 

T i 



1 



t=i 

T 



[n(x*-i)](x t ) 
1 

p t (x t |x*- 



t=l 

T 

E lo s 

{=1 



Pt(x 4 |x* *) 
1 



p t (x t |x 4 1 ) - 



(8) 



As before, the distinction between Tr t and p t is that the former 
is a mapping of X t_1 into the space of probability densities T>, 
while the latter is the image of x t_1 under Tr t . The worst-case 
regret of tv is defined as 

Rt{^C) 



SUpSUpi?T(7!" ]X ,p ) 

x P ec 

T 



sup 



E l0g ^ t (x t |x*-i) 



inf > log 



PtixtW 



It is clear that no T-step strategy 7r T = {irt}J = i can achieve 
regret smaller than the minimax regret 

R* T (C) = inf R T (ir;C), 

7T 

where the infimum is over all strategies. A fundamental result 
due to Shtarkov [25] says that i?^(C) is achieved by the 
normalized maximum likelihood estimator (NMLE) over C. 
The latter is computed from the joint density 



P 1 {x 1 ) 



T\ A 



SUP q6C <7 T (x T ) 



J x t sup qeC q T (£, T )dv(£ T ) 

SU PqgC U^lltjxtlx^ 1 ) 

fx* su Pt?ec nf=i gtm^d^ie) 



by letting 



,NMLE/„t-l 



(,■' 



p*- 1 (as*- 1 ) : 



t = l, 



where p s , 1 < s < T, denotes the marginal density of x s . 
However, practical use of the normalized MLE strategy is 
limited since it requires solving an optimization problem over 
C whose complexity increases with T. 

Mixture strategies provide a more easily computable alterna- 
tive: if the reference class C is parametrized, C = {p e : 8 G 0} 
with p e = {pe,t}t^i, then we can pick a prior probability mea- 
sure iDon0 and consider a mixture strategy = {tt^} 
induced by the joint densities 



via the posterior 

Aw) f t-i 



(x^^pti-^- 1 ) 



p t-i( x t-\y 



t = 1,2, 



t = 1.2, 



For instance, when the underlying observation space X is 
finite and the reference class C consists of all i.i.d. product 
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distributions on X°°, 
(KT) predictor [26] 



the well-known Krichevsky-Trofimov 



t-i\ 



N(a\x 



t-i\ 



1/2 



(f-l) + |X|/2 



a G X 



where N(a\x t ) is the number of times a occurs in x , is a 
mixture strategy induced by a Dirichlet prior on the probability 
simplex over X [4]. It can be shown that the regret of the KT 
predictor is 0(|X|logT) [26], [16]. 

The computational cost of updating the probability assign- 
ment using a mixture strategy is independent of T, However, 
as can be seen in the case of the KT predictor, the dependence 
of the regret on the cardinality of X still presents certain 
difficulties. For example, consider the case where X = {0, l} d 
for some large positive integer d. If we wish to bring the per- 
round regret T~ 1 i?x down to some given e > 0, we must 
have T/logT = Q(2 d /e). Moreover, when X = {0, l} d , the 
KT predictor will assign extremely small probabilities (on the 
order of l/2 d ) to all as yet unseen binary strings x G {0, l} d . 
This is undesirable in settings where prior knowledge about the 
"smoothness" of the relative frequencies of x is available. Of 
course, if the dimensionality k of the underlying parameter 
space is much lower than the cardinality of X, mixture 
strategies lead to 0(k log T) regret, which is minimax optimal 
[16]. This can be thought of as a generalization of the 
Minimum Description Length (MDL) principle [27], [4] to the 
online, individual-sequence setting. However, the predictive 
distributions ir t (x t ~ 1 ) generated by a mixture strategy will not, 
in general, lie in C, which is often a reasonable requirement. 
In addition, implementation of a mixture strategy requires 
obtaining the posterior ^^(x 1 " 1 ) = Pt(-|x t_1 ) at each time 
f, which can be computationally expensive. 

2) Noisy observations: We are interested here in a more 
difficult problem, namely sequential probability assignment in 
the presence of noise. That is, instead of observing the "clean" 
symbols x t G X, we receive "noisy" symbols z t £ Z (where Z 
is some other observation space). We assume that the noise is 
stochastic, memoryless and stationary. In other words, at each 
time t, the noisy observation z t is given by z t = N(x t ,r t ), 
where {r t } is an i.i.d. random sequence and N(-, •) is a fixed 
deterministic function. There are two key differences between 
this and the noiseless setting described earlier, namely: 

1) The prediction strategy now consists of mappings 7? t : 
Z* _1 — ► T>, where, at each time t, p t (-|z t_1 ) = 7r t (z t_1 ) 
is the conditional probability density we assign to the 
clean observation x t at time t given the past noisy 
observations z l ~ x . 

2) We cannot compute the true incurred log loss 
-logptixtlz 1 ' 1 ). 

We are interested in sequential prediction, via the beliefs 
7r t (z t_1 ) = pt^lz 1 ^ 1 ), of the next clean symbol xt given the 
past noisy observations We assume, as before, that the 

clean sequence x is an unknown individual sequence over X. 
Moreover, under our noise model the noisy observations {z t }t 
are conditionally independent of one another given x. 



B. Probability assignment in an exponential family via OCP- 
based filtering 

We will now show that if the comparison class C consists 
of product distributions lying in an exponential family with 
natural parameter 6 G M. d , then we can use OCP to design 
a scheme for sequential probability assignment from noisy 
data. The use of OCP is made possible by the fact that, in an 
exponential family, the (negative) log likelihood is a convex 
function of the natural parameter and a linear function of the 
sufficient statistic. 

Recall from Section II-C that a d-dimensional exponential 
family consists of probability densities of the form pe(x) = 
e (0,<t>(x))-<£(8) t where the parameter 9 lies in a convex subset 
of M. d . We will consider prediction strategies of the form 



9 t {z t - 1 )=p$(-), t = 1,2, 



(9) 



where 6 t is a function of the past noisy observations z* 1 . 
The log loss function in this particular case takes the form 



-\ogp t (x t ) = -{e t ,ct>{x t )) + ${e t ), 



x G X. 



Thus, the regret relative to any comparison strategy p e induced 
by a parameter sequence = {9t} G 0°° via p t (-\x t ~ 1 ) = 
Pe t {-) can be written as 

T 1 T 1 

R T (n T ;x T \pl) = V log — - V log — 

f^i P$S x t) j=t Pe t {xt) 



1 

[l{9 tl x t )~t(9 u x t )] , 



where we have defined the function 

l{6,x)±-(9,4>(x)) + *{9). 

Because the log partition function $ is convex, the function 
9 i y £(9,x) is convex for every fixed x E X. Therefore, if 
the observations were noiseless, i.e., z t = Xt for all t, then 
mirror descent could have been used to design an appropriate 
strategy of the form (9). As we will show next, this approach 
also works in the noisy case, provided an unbiased estimator 
of <fi(x) based on the noisy observation z is available. In fact, 
our results in the noisy setting contain the noiseless setting as 
a special case. 

Let us fix an exponential family V (</>). We will consider 
the comparison class consisting of product distributions, where 
each marginal belongs to a certain subset of V ((f)). Specifically, 
let A be a closed, convex subset of 0. We take C to consist 
of prediction strategies p , where = (61,62, ■■■) G A°° 
ranges over all infinite sequences over A, and each p e is of 
the product form pg 1 <E> pe 2 ® . . ., i.e., 



Pt,e(-\ xt =P9 t (0. 



x t-i e X *- 1 ,t= 12, 



(10) 



In other words, each prediction strategy in C corresponds to a 
time-varying product density whose marginals belong to {pg : 
6 G A}. From now on, we will use the term "strategy" to refer 
to an infinite sequence G A°°; the corresponding object p e 
will be implied. 
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Consider the noisy observation model z t = N(xt,r t ), 
where {r f } is the i.i.d. noise process and N(-,-) is a known 
deterministic function. We make the following assumption: 

Assumption 1. There exists a function h : Z — > M. d , such that 
¥*[h(z)\x\ = <j){x), where the expectation is taken w.r.t. the 
noise input r. In other words, h(z) is an unbiased estimator 
of the sufficient statistic (j)(x). 

Example 4. In Example 1, x e {0, l} d and (f)(x) = x, r e 
{0, l} d , where each component r(i) is a Bernoulli^) random 
variable independent of everything else (p < 1/2), and 



z(i) — x(i) © r(i), 



i = l, 



(11) 



In other words, every component of z is independently related 
to the corresponding component of x via a binary symmetric 
channel (BSC) with crossover probability p [28]. Then an 
unbiased estimator for 4>(x) = x would be given by h(z) = 
{hi(z), h d {z)) T , where 



hi(z) = 



-P 
l-2p ' 



i = 1, 



,d. 



Example 5. Consider the Ising model from Example 2 and 
suppose that each x a , a € V, is independently corrupted by a 
BSC with crossover probability p. Then an unbiased estimator 
for 4>{x) is given by 



h a {z) = 
h a p{z) = 



P 



1 - 2p' 

Z a - P Zfi 



a e V 



l-2p l-2p 



(a,/3)eE 



so that we have E[/i a (z)|x] = 4> a (x) = x a and E[/i Q ^(z)|x] = 

<t>ap{x) = X a Xf;. □ 

Example 6. Consider the Gaussian MRF from Example 3 and 
suppose that each x a , a € V, is independently corrupted by 
an additive white Gaussian noise (AWGN) channel with noise 
variance a 2 [28]. In other words, z — (z a : a € V), where, 
for each a € V, with r a ~ Normal (0, cr 2 ) 

independent of all other r^, ^9 7^ a. Then an unbiased estimator 
for 4>{x) is given by 

7 / \ . / \ I Z a — a = P air 

h a {z) = z a ,h a p{z) = < , a,peV 

so that E[/i Q (z)|x] = x a and E[/i a ^(z)|x] = x a xp. □ 
We can estimate £(9 t ,x t ) by the filtering loss 

£(d u zt) = -{e t ,h(z t )) + ^(dt). 

By virtue of our Assumption 1, £(8 t ,z t ) is an unbiased 
estimator of the true log loss £(8 t ,Xt)'- 



E 



x, 



-{9 t ,E[h(zt)\x t ]} + He t ) 
= -(0t,<l>(xt)) + ${6 t )=l(6 u x t ). 
This leads to the following prediction strategy: 



Algorithm 3 Sequential Probability Assignment via Noisy 

Mirror Descent 

Require: A closed, convex set A C 0; a decreasing sequence 
of strictly positive step sizes {?y t } 
Initialize with 6\ € A 
for t = 1,2, ... do 

Acquire new noisy observation z t 

Compute the filtering loss £ t {0 t ) = -(fit, K z t)) + 

Compute fi t = V$(0 t ) 

Dual update: compute p! t , j = Jit — r) t V£t(0t) 
Projected primal update: compute t+1 = V$*(^ +1 ) 
and 



7 t+i 



arg minD(0t + i| 



end for 



In other words (see the remark after Algorithm 2), we have 
the Noisy Mirror Descent strategy 7r, defined by 



ft =Pe t (') 
Ot+i 



(12a) 



arg mm 



,v#(flt)-M«t) 



-£>(0 t ||0) 



(12b) 



For the reader's convenience, Table I shows the correspon- 
dence between the objects used in Algorithm 3 and the generic 
mirror descent strategy, i.e., Algorithm 2. 

This approach has the following features: 

1) The geometry of exponential families leads to a natural 
choice of the Legendre potential and the corresponding 
Bregman divergence to be used in the mirror-descent 
updates, namely the log partition function $ and the 
Kullback-Leibler divergence D (• || • ) . 

2) The optimization at each time can be computed using 
only the current observation x t and the probability 
density p t estimated at the previous time; it is not 
necessary to keep all observations in memory to ensure 
strong performance. 

Azoury and Warmuth [29] proposed and analyzed an algorithm 
similar to (12) in the setting of online density estimation over 
an exponential family. However, they did not consider noisy 
observations and only proved regret bounds for a couple of 
specific exponential families. One of the contributions of the 
present paper is to demonstrate that minimax (logarithmic) 
regret bounds against static strategies can be obtained for a 
general exponential family, subject to mild restrictions on the 
feasible set AC0. This provides an answer to the question 
posed by Azoury and Warmuth about whether it is possible to 
attain logarithmic regret for a general exponential family. 

C. Regret bounds for OCP -based filter 

We will now establish the following bounds on the expected 
regret of Algorithm 3: 

1) If the comparison class C consists of static strategies 
9% =02 = ■ • ■ over A, then, under certain regularity 
conditions on A and with properly chosen step sizes 
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Generic MD 


Algorithm 3 


Convex set 


U 


e 


Cost functions 


ft 


£(-,*) = -<.,fc(*)>+*(.) 


Project onto 


S 


A 


Legendre potential 


F 


$ 


Bregman divergence 


JMv) 


D,(.,.) = J?(.||-) 



TABLE I 

Correspondence between the generic mirror descent (MD) and Algorithm 3. 



{rjt}, the expected regret of the strategy in (12) will be 
O(logT). 

2) Given a strategy 9 and a time horizon T, define the 
variation of 9 from t = 1 to t = T as 



and the difference 



V T (0) 



T 

= £1 



7 t+ni) 



(13) 



where || • || is taken to be the £2 norm for concreteness. 
If the comparison class C consists of all time-varying 
strategies 9 = {9 t } over A, then, under certain regularity 
conditions on A and with properly chosen step sizes 
{ijt}, the expected regret of the algorithm in (12) will 
be 0((V T (8) + 1)VT). 
The expectation in both cases is taken w.r.t. the noise process 
{ft}. Moreover, in the absence of noise (i.e., z t = Xt for all t), 
the above regret bounds will hold for all observation sequences 
x. 

We will bound the regret of Algorithm 3 in two steps. In the 
first step, we will obtain bounds on the regret computed using 
the filtering losses £(•, •) that hold for any realization of the 
noisy sequence z = {z t }t- In the second step, we will use a 
martingale argument along the lines of Weissman and Merhav 
[30] to show that the expected "true" regret is bounded by the 
expected filtering regret. 

1 ) Regret bounds for the filtering loss: We will consider 
time-varying strategies of the form (10), where the set A is 
restricted in the following way. Given a positive constant H > 
0, define the set 

Q H ^ {9 e A:V 2 4>(0) h2HI dxd }, 

where Idxd denotes the d x d identity matrix, and the matrix 
inequality A y B denotes the fact that A — B is positive 
semidefinite. Note that the Hessian V 2 $(#) is equal to 

J(9)^-E 6 [V 2 e \og Pe (X)}, 

which is the Fisher information matrix at [31], [22], [23]. 
Our assumption on A thus stipulates that the eigenvalues of 
the Fisher information matrix over A are bounded from below 
by 2H. 

For any strategy 9 = {9t]t, define the cumulative true and 
estimated losses 

T 



t=i 

Lo,t{z T ) = E 
i=l 



A 9 , T (x T ,z T )±Le, T ( x T )-Lo, T (z T ) 



T 

E 

(=i 



{6 t ,h{zt)-4>{xt))- 



When 9 is a static strategy corresponding to 6 G A, we 
will write Lg t T(x T ), Lq,t{z t ), and Ag y T(x T , z T ). We first 
establish a logarithmic regret bound against static strategies in 
A. The theorem below improves on our earlier result from [8]: 

Theorem 1 (Logarithmic regret against static strategies). Let 
A be any closed, convex subset of Oh, and let 9 = {9t} be 
the sequence of parameters in A computed from the noisy 
sequence z = {z t }t using the OCP procedure shown in 
Algorithm 3 with step sizes r) t = l/t. Then, for any 9 £ A, 
we have 



Le T {z T ) <L e , T (z T ) + D(9 1 \\9) 



+ ggj±Mg (log T + 1 ), 



(14) 



where 
K(z T ) 



- max \\h(z t )\\ and M = i max || V$(0)|| 



Proof: Appendix B. ■ 
With larger step sizes r\ t = it is possible to com- 

pete against time-varying strategies 9 — {9 t }t, provided the 
variation is sufficiently slow: 

Theorem 2 (Regret against time-varying strategies). Again, let 
A be any closed, convex subset ofQn- Let 9 be the sequence of 
parameters in A computed from the noisy sequence z = {z t }t 
using the OCP procedure shown in Algorithm 3 with step sizes 
rj t = Then, for any sequence 9 = {9 t }t over A, we have 



L d T {z T ) < L 0>T (z T ) + D(dx ||0i) + Dy/T + 1 

{K{z T ) + Mf 



+ 4MVTV T (9) 



H 



where K(z T ) and M are defined as in Theorem 1, D = 
max^e'gA D(9\\9'), and Vt{9) is defined in (13). 

Proof: Appendix C. ■ 

Remark 2. Regret bounds against a dynamically changing 
reference strategy have been derived in a variety of contexts, 
including prediction with expert advice with finitely many 
time-varying experts [32], sequential linear prediction [33], 
general online convex programming [5], and sequential uni- 
versal lossless source coding (which is equivalent to universal 
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prediction with log-loss) [34], [35]. It is useful to compare the 
result of Theorem 2 with some of these bounds. The results of 
Herbster and Warmuth [33] and Zinkevich [5] assume fixed 
and known horizon T. Furthermore, Herbster and Warmuth 
[33] assume that the loss function is of subquadratic type 
(see, e.g., Section 11.4 in [16]) and use a different scale- 
dependent notion of regret and a carefully adjusted time- 
independent step size. On the other hand, Zinkevich [5] uses 
a constant step size r\ and obtains a regret bound of the 
form 0(Vt /rj + Trf), where the constants implicit in the O(-) 
notation depend on the diameter of the feasible set and on the 
maximum norm of the (sub)gradient of the loss function. It is 
not hard to modify the proof of our Theorem 2 to obtain a 
similar regret bound in our case, including the constants. The 
results of Shamir and Merhav [35] (which extend the work 
of Willems [34]) deal with universal prediction of piecewise- 
constant memoryless sources under log-loss. They propose 
two types of algorithms — those with linearly growing per- 
round computational complexity, and those with fixed per- 
round complexity. For the first type, and with Vt = 0(1), they 
obtain O(logT) regret (which is optimal [36]); for the second 
type, again with Vt = 0(1), they develop two schemes, one of 
which attains O(logT) regret for certain sources with "large" 
jumps and 0(T log log Tj log T) in general, while the other 
always achieves 0(\JT log T) regret. Since our algorithms 
have fixed per-letter complexity, it is clear that we cannot 
achieve the optimal O(logT) regret; however, our O(VT) 
regret in the Vt = 0(1) case compares favorably against the 
second fixed-complexity algorithm of [35]. 

2) Bounds on the expected true regret: We now proceed 
to establish regret bounds on Lq i t(x t ). The bounds of the 
preceding section reflected how close our cumulative loss 
might be to that of a competing strategy on noisy data. We 
now show that our proposed strategy ensures that the expected 
cumulative loss on the unobserved clean data is close to that 
of competing strategies. First, we need the following lemma, 
which is similar to Lemma 1 in [30]: 

Lemma 2. Let r — {r t }t be the i.i.d. observation noise 
process. For each t, let lZ t denote the a-algebra generated 
by ri, . . . , r t . Let — {Ot}t>\ be a sequence of probability 
assignments, such that each 6 t = 6 t (z t ^ 1 ). Then, for any 
individual sequence x = {x t }, {Aejix 1 , z t ),lZ t } t is a 
martingale, and so we have ELe r(z T ) = ELqt(x t ) for 
each T. The expectation is conditional on the underlying clean 
sequence x. 

Proof: Appendix D. ■ 
This leads to regret bounds on the proposed OCP-based 
filter: 

Theorem 3. Consider the setting of Theorem 1. Then we have 



E 



J e,T 



(x T ) 



< inf 

0eA 



L e , T (x T ) + D{fii\\B) 
E[(K(z T ) + M) 2 \x T ' 



H 



KlogT + 1) 



(15) 



Likewise, in the setting of Theorem 2, we have 



E 



Lq t (x t ) 



< inf 
e 



Le.rix 1 ) + D(9 1 1| 9 X ) + 4MVTV T (0 



i + E ^ T )+ M)2|xT W-i), 



H 



(16) 



where the infimum is over all strategies over A. 



Proof: We will only prove (16); the proof of (15) is 
similar. For the sake of brevity, we will drop the conditioning 



on x and simply write E[-] instead of E[ 



Proceeding 



analogously to the proof of Theorem 4 in [30], we have 

EL dT (x T ) = EL dT (z T ) 

< E j inf L e , T (z T ) + £>(0i||0i) + 4:MVtVt(6) 

(2Vf-l)\ 



dVt + i 



E[K(z T ) + Mf 
H 



< inf 

9 



inf 

9 



EL e , T (z T ) + D&WO^ + 4MVtV t (9 
E[(K(z t ) + M) 2 i 



dVt + i 



H 



(2VT- 1) 



L 9 , T (x 1 ) + D(9 1 \\ei) + 4MVTV T (8) 



dVt + i 



E[(£T(z T ) + Mf 



H 



I(2VT-1), 



where the first step follows from Lemma 2, the second from 
Theorem 2, the third from the fact that Einf[-] < inf E[-], and 
the last from Lemma 2 and the fact that the expectation is 
taken with respect to the distribution of z t \x t . ■ 

Remark 3. In the usual regret notation, the bounds of Theo- 
rem 3 can be written as follows: 

E[(K(z T ) +M) 2 \x T] 



ERtItt 1 -^ 1 
and 



< D 



H 



L (logT+l) 



ER t (tt t ; x t , 9 T ) < D(VT +1 + 1) 
E[(K(z T ) -j 



AMVfV T (d) 



M)V 



H 



(2VT-1), 



where we have overestimated the terms D(9i\ 
D(0i||0i) by D. 



and 



3) Minimax optimality and Hannan consistency: Finally, 
we make a few comments regarding minimax optimality and 
Hannan consistency of the strategies described in this section. 

Recall the OCP game described in Section II-A. During each 
round t = 1,2, ... ,T, the Forecaster plays a point u t E U, 
the Environment responds with a convex function f t £ F, and 
the Forecaster incurs the cost ftju t ). The Forecaster's goal is 
to keep the cumulative cost 2~2t=i ft(^k) as l° w as possible. 
Let us suppose, moreover, that the Environment is antagonistic 
in that it tries to choose the functions / t , so that the current 
cumulative cost 5Z a=1 f s (u s ) is as high as possible, given the 
past moves of the Forecaster, u l , and the past cost functions 
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/* . To allow the Environment more freedom, we assume 
that the cost function at time t are selected from a set Tt 
which may depend on the current move u t . With this in mind, 
let us define, following Abernethy et al. [14], the minimax 
value of the game as 

i?J(U, J- T ) = inf sup ... inf sup 

(f>M- inf (17) 

U=i t=i ) 

In words, i?^(U, J 77 ") is the worst-case regret of an optimal 
strategy for the Forecaster. The order of the infima and the 
suprema in (17) reflects the order of the moves and the 
causality restrictions. Thus, the Forecaster's move at time t 
may depend only on his moves and on the cost functions 
revealed at times s = 1, .... £ — 1; the Environment's cost 
function at time t may depend only on the Forecaster's moves 
at times s = l,,..,t and on the cost functions at times 
s = 1, . . . ,t — 1. Then we have the following bounds: 

Theorem 4 ([14]). Suppose that U C M d is compact and 
convex, and there exist some constants G, a > such that at 
each time t the functions in T t satisfy the following conditions: 

7 2 . 



•Ft = {/ : ||V/(ut)|| < G, V 2 / t vldxd) 



Then 



f- log(T + 1) < i$(U, F T ) < ^(logT + 1). 
la la 

We can particularize this result to our case. Consider the 

setting of Theorem 1 in the noiseless regime: x t — z t , Vi. Let 

us fix a constant K > and let U = A and 

Fi = ... = F T = {fx = t(;x) 

= -{.,<t>(x))+9(-):x€X,U(x)\\<2K}. 

Then, by hypothesis, each f x £ T t satisfies 

VJW)\\ < \\<K*)\\ + l|V*(0)|| < 2(K + M). 

Moreover, each f x £ Ft satisfies Vf x (9) = V$(0) >z 
2HItixd at every 6 £ A. Thus, applying Theorem 4 with 
G = 2{K + M) and a = 2H, we get 

(K + Mf 



H 



■log(T+l)<R* T (A,T T ) 
(K + M) 2 



< 



H 



(logT+1). 



On the other hand, with these assumptions we have K(z T ) = 
K(x T ) — K, and so our regret bound from Theorem 1 is of 
the form 

(K + M) 2 



R T <D 



H 



(logT + 1), 



where D = maxg^'gA D{6\\9'). Thus, up to an additive 
constant D, we attain the minimax value R^(A, J- T ). 

We can also establish a weak form of Hannan consistency 
for the OCP filters of Theorems 1 and 2. Let Q denote the set 
of all sequences x £ X°°, such that 

E[(K(z T ) + M) 2 \x T ] =o{T/\ogT). (18) 



For example, if h and <\> have uniformly bounded norms, then 
Q = X°°. Then the filter of Theorem 1 satisfies 



lim sup — sup E Rt (tt 1 ; a 



= 



for any x £ Q. As for the filter of Theorem 2, consider any 
set C of all time-varying strategies 9 £ A°°, such that 



supV T (0) 
eec 



Then 



1 

T eec 



lim sup — sup E Rt(tt t '; x T , 



= 



(19) 



for any x £ Q. For example, consider the set C of all piecewise 
constant strategies 6 £ A°°, such that kx, the maximum 
number of switches in any 9 T+1 , satisfies kx — o{yT). 
Assume also that A is compact. Then for any 6 £ C we have 



supVr(0) = supY^ |ji 
eec eec ^ 

< diam(A) 

= o(VT), 



7 t+i 



(20) 



so we have Hannan consistency. 

Stronger bounds that hold uniformly over the choice of 
the observation sequence x are also possible, provided the 
convergence in (18) is uniform; in other words, if we have a 
set Q C X°°, such that 

E[(K(z T ) + M) 2 \x T } logT 



lim sup sup 

T-hx xeG 



0. 



then the convergence in (19) and (20) is uniform in x £ G- 
IV. Hedging: sequential threshold selection for 

ANOMALY DETECTION 

In the preceding section we have shown how to perform 
filtering, i.e., how to assign a belief p t = pt^z*) to the clean 
symbol xt, such that 



log 



1 



Pt(z l 



1 

Ptixtlx*- 1 ) ' 



where Pt(-\-), t = 1, . . . , T, is the optimal sequence of condi- 
tional probability assignments under a (possibly time-varying) 
exponential family model, for the entire clean observation 
sequence x T . 

The second ingredient of FHTAGN is hedging, i.e., se- 
quential adjustment of the threshold r t , such that whenever 
Ct — C(Pt) < T t, where £ : R + — >• M is a user-specified 
monotonically increasing function, we flag z t as anomalous. 

Remark 4. Note that this formulation is equivalent to sequen- 
tially setting a threshold r f such that, whenever p t < Tt, we 
flag z t as anomalous. Using the monotone transformation £ 
allows us to sidestep challenging numerical issues when p t is 
very small. We will elaborate on this point later. 

In order to choose an appropriate r t , we rely on feedback 
from an end-user. Specifically, let the end user set the label 
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y t as 1 if zt is anomalous and —1 if Zt is not anomalous. 
However, since it is often desirable to minimize human in- 
tervention and analysis of each observation, we seek to limit 
the amount of feedback received. To this end, two possible 
scenarios could be considered: 

• At each time t, the Forecaster randomly decides whether 
to request a label from the end user. A label is requested 
with probability that may depend on f t and r*. 

• At each time t, the end-user arbitrarily chooses whether 
to provide a label to the Forecaster; the Forecaster has 
no control over whether or not it receives a label. 

As we will see, the advantage of the first approach is that it 
allows us to bound the average performance over all possible 
choices of times at which labels are received, resulting in 
stronger bounds. The advantage of the second approach is 
that is may be more practical or convenient in many settings. 
For instance, if an anomaly is by chance noticed by the end 
user or if an event flagged by the Forecaster as anomalous is, 
upon further investigation, determined to be non-anomalous, 
this information is readily available and can easily be provided 
to the Forecaster. In the sequel, we will develop performance 
bounds for both of these regimes. 

In both settings, we will be interested in the number of 
mistakes made by the Forecaster over T time steps. At each 
time step t, let y t denote the binary label output by the 
Forecaster, y t = sgn(r t — £ t ), where we define sgn(a) = —1 
if a < and +1 if a > 0. The number of mistakes over T 
time steps is given by 

T T 
f=l t=l 

For simplicity, we assume here that the time horizon T is 
known in advance. We would like to obtain regret bounds 
relative to any fixed threshold r g [T m i n ,T max ], where r m ; n 
and r max are some user-defined minimum and maximum 
levels, that could be chosen in hindsight after having observed 
the entire sequence of ((^-transformed) probability assignments 
{Ct}?=i an d feedback {yt}t=i (note that some y t 's may be 
"empty," reflecting the lack of availability of feedback at the 
corresponding times). Ideally, we would like to bound 

T T 
X, 1 {sgn(T t -C*)^»} _ cr inf 1 zJ 1 { 8 K n (' r -C*)^*}- (22) 

TG T m in,T ma x 

t=l 1 J t=l 

However, analyzing this expression is difficult owing to the 
fact that the function r n- l{ S gn(r-c)^y} * s not convex m T - 
To deal with this difficulty, we will use the standard technique 
of replacing the comparator loss with a convex surrogate 
function. A frequently used surrogate is the hinge loss 

t( s >y) - (i - s v)+, 

where (a) + = max{0, a}. Indeed, for any /, r and y we have 

l{sgn(T-C)7^} ^ 1 {(r-Qv<0} < i 1 ~ ( T ~ Ov) + - 

Thus, instead of (22), we will bound the "regret" 

T T 

^m 4 E 1 {6«-Ew- (23) 
t=i t=i 



where lt{ T ) is shorthand for £(t — CuVt)- In the following, 
we show that it is possible to obtain 0(yf) surrogate regret 
using a modified mirror descent (more precisely, projected 
subgradient descent) strategy. The modifications are necessary 
to incorporate feedback into the updates. 

A. Anomaly detection with full feedback 

In order to obtain bounds on the surrogate regret (23), we 
first analyze the ideal situation in which the Forecaster always 
receives feedback. Let n( ) denote the projection onto the 
interval [r min ,r max ]: 

11(a) = argmin (r — a) 2 . 

In this setting, the following simple algorithm, which is 
essentially the perceptron algorithm (see, e.g., Chapter 12 of 
[16]) with projections, does the job: 



Algorithm 4 Anomaly detection with full feedback 
Parameters: real numbers r\ > 0, r min < r max 
Initialize: t\ = r min 
for t = 1,2,...,T do 

Receive the estimated likelihood p t , set £ t = C(Pt) 

if ( t < i~ t then flag z t as an anomaly: y t = 1 else let 

m = -l 

Obtain feedback y t 
Let Tt+i = n(r t + -qy t l { y t ^ Vt} ) 
end for 



Intuitively, the idea is this: if the Forecaster correctly 
assigns the label y t to z t , then the threshold stays the same; 
if the Forecaster incorrectly labels a nominal observation 
(yt = —1) as anomalous (y t — 1), then the threshold is 
lowered: r t+ i w r t — 77; if the Forecaster incorrectly labels 
an anomalous observation (yt = 1) as nominal (y t — — 1), 
then the threshold is raised: r t+ i ss r f + 77. We also observe 
that the above algorithm is of a mirror descent type with the 
Legendre potential F(u) = u 2 /2, with one crucial difference: 
the current threshold r t is updated only when the Forecaster 
makes a mistake. We obtain the following regret bound: 

Theorem 5. Fix a time horizon T and consider the Forecaster 
acting according to Algorithm 4 with parameter r/ = l/y/T. 
Then, for any r € [T m ; n , T max ], we have 

T T 

R t(t) = ]T l { y t ^ yt} - ^(r) < Vf. (24) 
t=i t=i 

Proof: Appendix E. ■ 

B. Random, Forecaster-driven feedback times 

We can now address the problem of online anomaly detec- 
tion when the Forecaster has an option to query the end-user 
for feedback when the Forecaster is not sufficiently confident 
about its own decision [37]. Consider the following label- 
efficient Forecaster for anomaly detection using sequential 
probability assignments: 
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Algorithm 5 Label-efficient anomaly detection 
Parameters: real numbers i] > 0, r min < r max 
Initialize: t\ = r min 
for t = 1,2,... do 

Receive the estimated likelihood p t , set £ t = C(Pt) 

if (t < r t then flag z t as an anomaly: y t = 1 else let 

& = -i 

Draw a Bernoulli random variable Ut such that Pr[{/ t = 
1IC/'- 1 ] = 1/(1 + |Ct-r t |) 

if Ut = 1 then request feedback y t and let r t+ i = 
n (r t + T]yd{y t ^ yt }) else let T i+ i = r t 
end for 



This algorithm is, essentially, the label-efficient perceptron 
(see Section 12.4 of [16]) with projections, and it attains the 
following regret: 

Theorem 6. Fix a time horizon T and consider the label 
efficient Forecaster run with parameter T) = 1/ VT. Then 



E 



,t=i 



T 

*=i 



T. 



where the expectation is taken with respect to {U t }t- 

Remark 5 (Computational issues involving very small num- 
bers). In some applications, the beliefs p t may be very 
small. (For instance, in the Enron example presented in the 
experimental results, p t — 0(e~ 100 ).) In such a case, we 
will have Pr[U t — 1 1 1/* 1 ] ~ 1, and our anomaly detection 
engine will request feedback almost all the time. To avoid this 
situation, the monotone transformation £ is applied to p t before 
thresholding. For instance, one might consider £(s) = Cs or 
£(s) = C log s for an appropriately chosen positive number C. 
Note that the choice of £ changes the form of the surrogate loss 
function. Thus care must be taken when choosing £ to ensure 
(a) it approximates the original comparator loss as accurately 
as possible, (b) a reasonable number of feedback requests are 
made, and (c) numerical underflow issues are circumvented. 

Proof: Appendix F. ■ 

C. Arbitrary feedback times 

When labels cannot be requested by the Forecaster, but are 
instead provided arbitrarily by the environment or end user, 
we use the following algorithm to choose the threshold r at 
each time t: 

Algorithm 6 Anomaly detection with arbitrarily spaced feed- 
back^ 

Parameters: real number 77 > 0, r min < r max 
Initialize: t\ = r min 
for t = l,2,...,Tdo 

Receive the estimated likelihood p t , set £ t = ((pt) 

if £t < rt then flag z t as an anomaly: y t = 1 else let 

m = -1 

if feedback y t is provided then let Tt+i = H(r t + 
Vytl{y t ^ yt }) else let r t+1 = r t 
end for 



Under arbitrary feedback, it is meaningful to compare the 
performance of the Forecaster against a comparator t only at 
those times when the feedback is provided. We then have the 
following performance bound: 

Theorem 7. Fix a time horizon T and consider the anomaly 
detection with arbitrarily spaced feedback Forecaster run with 
parameter r\ = 1/yT. Let t\, . . . ,t m denote the time steps at 
which the Forecaster receives feedback, and let e = m/T. 
Then, for any r G [t, 



mm ; ' max J 



we have 



i=l i=l 



(25) 



Proof: If we consider only the times {t\, . . . ,t m }, then 
we are exactly in the setting of Theorem 5. This observation 
leads to the bound 

m m 



With the choice r\ 
theorem. 



1/y/T, we get the bound in the 



D. Arbitrary horizon 

So far, we have considered the case when the horizon T 
is known in advance. However, it is not hard to modify the 
proofs of the results of this section to accommodate the case 
when the horizon is not set beforehand. The only change that 
is required is to replace the learning rate 77 = l/VT with 
a time-varying sequence {r) t }, where r) t = 1/yt. Then the 
regret bounds remain the same, namely, O(VT), possibly with 
different constants. 

V. Experimental results 

In this section, we demonstrate how our proposed anomaly 
detection approach FHTAGN performs on simulated and real 
data. We consider four numerical experiments. Experiments 
1, 2a and 2b use simulated data. Experiment 1 tests the 
filtering component of FHTAGN. Experiments 2a and 2b focus 
on the hedging component of FHTAGN. Experiment 3 applies 
FHTAGN to the Enron email database [38]. 

For Experiments 1 and 2, the simulated data are drawn 
from time-varying Bernoulli product densities, which allows 
us to compute empirical regret with respect to the known 
data generation parameters, and to compare it against the 
theoretical regret bound in Theorem 2. We draw i.i.d. ob- 
servations from 500-dimensional Bernoulli product densities: 

x t ~ ni£°i Bemoulli (X*t)> where Pi.t € [0,1]. We use 1 < 
t < 1000, and generate three jumps in /i£ = (ftl t , . . . , (3^ 00 t ) 
at t = 100, 500 and 700. Our observations z t are the noisy 
versions of xt, where each bit Xt(j) is passed through a BSC 
with crossover probability 0.1. At each jump, we define 25 true 
anomalies to emphasize that events are considered anomalous 
relative to temporal windows, i.e., anomalies occur not only 
at the change in probability models, but also at the first few 
instances of the model change before nominal behavior is once 
again restored. 
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Experiment 1. Here we apply Algorithm 3 to the above data. 
We set the learning rate rjt = l/yt. Figure 1(a) illustrates 
the ground truth [i* t vs. the estimated parameter £t t . Figures 
l(c)-(d) show that the log -loss exhibits pronounced spikes at 
the jump times, and then subsides as the Forecaster adapts to 
the new parameters. Note that the variance of the log-loss is 
larger for the noisy case. Finally, Figure 1(e) shows that the 
empirical per-round regret is well below the theoretical bound 
in Theorem 2, with the regret for the noisy case again slightly 
larger. 

Experiment 2a. We now consider the detection of anomalies 
using Algorithm 5. In this experiment, the Forecaster queries 
the end-user for feedback when \((pt) — t 4 | is small. As 
discussed in Remark 5, we use the transformation £(s) = Cs 
with C = cxp(220). 

Experiment 2b. We now test Algorithm 6, which is designed 
for the case when feedback is provided at arbitrary times. 
In particular, in this simulation feedback is always provided 
when the Forecaster declares an event anomalous and with 
20% probability if it misses an anomaly. 

Figure 2 shows the results of both experiments, where the 
declared anomalies are shown as black dots superimposed on 
the log-loss, and the feedback times are depicted underneath. 
For both cases note how, after the first jump (when feedback is 
first received), the Forecaster adapts its threshold, enabling it to 
dramatically increase its detection rate in subsequent jumps. 
Specifically, Table II shows the number of detection misses 
and false alarms for both Algorithms 5 and 6. For comparison, 
we also show the same performance measures for the best 
static threshold chosen in hindsight. In both experiments, 
FHTAGN significantly outperforms static thresholding. The 
number of false alarms is significantly lower than that of 
detection misses due to the high initial value of the threshold 
r t , which is driven lower as feedback arrives. 





Label-efficient 


Arbitral-}' feedback times 






Best static 




Best static 




FHTAGN 


threshold 


FHTAGN 


threshold 


Total Errors 


30 


44 


34 


46 


False alarms 


3 


8 


3 


9 


Detection misses 


27 


36 


31 


37 



TABLE II 

Performance comparison of fhtagn with anomaly detection 
using the best static threshold for experiments 2a and 2b. 
fhtagn commits significantly fewer errors. for the enron 
experiment, feedback was requested for 91 of the 902 days 
considered, and only 523 of the 902 days had their text parsed. 



Experiment 3. Algorithm 5 was applied to the Enron email 
database [38], which consists of approximately 500, 000 e- 
mails involving 151 known employees and more than 75,000 
distinct addresses between the years 1998 and 2002. We use 
email timestamps in order to record users that were active 
in each day, either sending or receiving emails. This was 
done for 1, 177 days, starting from Jan 1, 1999. We removed 
days during which no email correspondence occurred, and 
we consolidated each weekend's emails into the preceding 
Friday's observation vector, resulting in a total of 902 days 
in our dataset. For each day t, x t G {0,1}" is a binary 



vector indicating who sent or received email that day. In 
this setting, we let cf)(x) — x leading to a dimensionality of 
n = d = 75, 511. (There is no noise in this case, since we can 
accurately identify sender and recipient email addresses.) 

Algorithms 3 and 5 were applied to this dataset. The thresh- 
old r was initialized at p\ . We implemented Algorithm 5 using 
C( s ) = Clogs with C = 0.0079 and r\ = 1450. Feedback 
was received when requested according to Algorithm 5. We 
generate oracle or expert feedback based on the email text 
from the difference in word counts between day t and each of 
the previous 10 days, and average the result. Upon feedback 
request at time t, we generate word count vectors h t using the 
12,000 most frequently appearing words (to avoid memory 
issues and misspelled words) for days t — 10, ... , t. The 
error e t is computed as 

1 

Gf = To 51 11^ -Mi 

j=t-10 

where || • 1^ is the l\ norm. When the prediction error e t is 
sufficiently high, we consider day t to be anomalous according 
to our expert. The oracle prediction error e< and the threshold 
determining y t are shown in the right upper plot in Figure 
3, but note that only a fraction of these values need to be 
computed to run FHTAGN whenever feedback is requested. 

The results of Algorithm 5 are summarized in Table III and 
Figure 3. FHTAGN performs very well relative to a comparator 
online anomaly detection method which consists of comparing 
p t to the best static threshold, which was chosen in hindsight 
with full knowledge of all filtering outputs and feedback. The 
left plot in Figure 3 shows the time varying threshold tj in 
response to user feedback. In this experiment, ground truth 
anomalies do not always correspond to large values of — log fit 
but rather to the degree to which contextual evidence differs 
from recent history. With a 10-day memory, the notion of 
what constitutes an anomaly is constantly evolving, and r t 
adjusts to reflect the pattern. The right lower plot describes 
the probability of requesting feedback over time with the days 
on which feedback was requested indicated with black dots. 
This plot suggests that feedback is more likely to be requested 
on days where ((pt) and r t have similar magnitude, which 
is expected. Feedback was requested 91 out of 902 days, and 
because of the sliding window used by our oracle to determine 
the true labels y t , a total of 523 of the 902 days required text 
parsing (and, generally speaking, any overhead associated with 
decrypting, transcribing, or translating documents). 







Best static 




FHTAGN 


threshold 


Total Errors 


73 


143 


False alarms 


35 


96 


Detection misses 


38 


47 



TABLE III 

Performance comparison for fhtagn and the best static 
threshold on enron data set. feedback was requested for 91 of 
the 902 days considered, and only 523 of the 902 days had their 
text parsed. 

Some of the most anomalous events detected by our pro- 
posed approach correspond to historical events, as summarized 
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Fig. 1. (a) Ground truth and (b) Estimated /2 t . The fj, values correspond to Bernoulli means, where lighter colors depict higher probabilities, 
(c) Evolution of the log-loss from noiseless observations, (d) Evolution of the log-loss from noisy observations. The spikes at the jump times 
(t — 100, 500, and 700) correspond to model changes. Note that the variance of the log-loss is larger for the noisy case, (e) Per-round regret 
compared to theoretical upper bound. Again, the regret is larger for the noisy case. 
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Fig. 2. (a) Anomalies detected (shown as black dots) and false alarms (red dots) by Algorithm 5, superimposed on the log-loss. Forecaster 
query times are shown below the log-loss. Jump times (t = 100, 500, and 700) are indicated by green squares, (b) A similar plot of anomalies 
detected by Algorithm 6, with arbitrarily spaced feedback. In both cases, there were 25 true anomalies immediately following each jump. 
After the first jump, the Forecaster adapted its threshold, enabling it to dramatically increase its detection rate in subsequent jumps. 



in Table IV. These examples indicate that the anomalies 
in social network communications detected by FHTAGN are 
indicative of anomalous events of interest to the social network 
members. 

VI. Conclusion 

We have proposed and analyzed a methodology for se- 
quential (or online) anomaly detection from an individual 
sequence of potentially noisy observations in the setting when 
the anomaly detection engine can receive external feedback 



confirming or disputing the engine's inference on whether 
or not the current observation is anomalous relative to the 
past. Our methodology, dubbed FHTAGN for Filtering and 
Hedging for Time-varying Anomaly recoGNition, is based 
on the filtering of noisy observations to estimate the belief 
about the next clean observation, followed by a threshold test. 
The threshold is dynamically adjusted, whenever feedback is 
received and the engine has made an error, which constitutes 
the hedging step. Our analysis of the performance of FHTAGN 
was carried out in the individual sequence framework, where 
no assumptions were made on the mechanism underlying 
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Fig. 3. Online anomaly detection results on Enron corpus. Left plot displays filtering output, locations of missed anomalies (as declared by 
our oracle), false positives, and correctly identified anomalies, as well as time-varying threshold r. Upper right plot displays the probability 
of requesting feedback where black circles indicate the locations where feedback was provided. Lower right plot displays oracle prediction 
error e t (from contextual evidence within a sliding window) compared to a static threshold to assign y t . 



Date 


Significance 


Dec. 1, 2000 


Days before "California faces unprecedented energy 
alert" (Dec. 7) and energy commodity trading dereg- 
ulated in Congress. (Dec. 15) [39]. 


May 9, 2001 


"California Utility Says Prices of Gas Were Inflated" 
by Enron collaborator El Paso [40], blackouts affect 
upwards of 167,000 Enron customers [41]. 


Oct. 18, 2001 


Enron reports $618M third quarter loss, followed by 
later major correction [42]. 



TABLE IV 

Significant dates in Enron's history and our analysis. 



the evolving observations. Thus, performance was measured 
in terms of regret against the best offline (nonsequential) 
method for assigning beliefs to the entire sequence of clean 
observations and then using these beliefs and the feedback 
(whenever available) to set the best critical threshold. The 
design and analysis of both filtering and hedging was inspired 
by recent developments in online convex programming. 

One major drawback of the proposed filtering step is the 
need to compute the log partition function. While closed-form 
expressions are available for many frequently used models 
(such as Gaussian MRFs), computing log partitions of general 
pairwise Markov random fields is intractable [23]. While there 
exist a variety of techniques for approximate computation of 
log partition functions, such as the log determinant relaxation 
[43], [44], these techniques themselves involve solving convex 
programs. This may not be an issue in the offline (batch) 
setting; however, in a sequential setting, computations may 
have to be performed in real time. Therefore, an important 
direction for future research is to find ways to avoid computing 
(or approximating) log partition functions in the filtering step, 
perhaps by replacing the full likelihood with an appropriate 
"pseudo-likelihood" [45], [46]. 



Appendix 

A. Proof of Lemma 1 

This lemma is a standard ingredient in virtually all analyses 
of online prediction algorithms using mirror descent (see, 
e.g., the proof of Theorem 11.4 in [16]). Nevertheless, since we 
use the structure of mirror descent rather heavily in subsequent 
proofs, we provide the full proof here. 

By the convexity of f t , we can write 

ft(ut) < ft(u) - (gt(u t ),u-u t )- (26) 

Now, using the definition of the dual update and the fact that 

£t+i = VF(u t +i), we have 

9t(u t ) = - l+i) = -(Vf(St) - VF(5t +1 )). 

vt vt 

Substituting this into (26) and using (3), we obtain 

ft{u t ) < ft(u) + -hF(u t+1 ) - VF(«t),u - u t ) 
Vt x ' 

= ft(u) H (D F (u,u t ) + D F (u t ,u t+ i) 

Vt V 

- D F {u,Ut +1 j). (27) 

Now, since u £ S and ut+i = tlF,s(ut+i), we can use me 
generalized Pythagorean inequality (4) to obtain 

D F (u,u t+1 ) > D F (u,u t+ i) + D F (u t+1 ,u t +i) 
> D F (u,u t+ i) 

(the second inequality is due to the fact that D F (-,-) is always 
nonnegative). Using this in (27), we get (5). 
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B. Proof of Theorem 1 



and (31) and rearranging yields 



The proof combines elements of the standard analysis of 
mirror descent (see, e.g., [7], [17]) with arguments from OCP 
literature [13]. In order to apply these techniques, we first 
make a crucial observation pertaining to strong convexity of 
the log-likelihood function in an exponential family. 

For each t, let us use the shorthand £t(0) to denote the 
filtering loss 1(8, Zt), 8 G A. We start by observing that, for 
any 0, 0' G we have 



l t (8) - \£ t (8>) + (V£ t (8 , ),8- 

= -(8, h(z t )) + $(8) - [ - (8 1 , h(zt)) + $(0') 
- (h(z t ),8 - 8') + (V<Z>(8'),8 - 8')] 
= (9' -6,h(zt)) + (6-6',h(z t )) 

v ' 

=0 

+ $(<?) - $(0') - (V$(0'), - 8') 
= D{8'\\8). (28) 

In the terminology of [13], (28) means that the function 
8 i-> £t(8) is strongly convex w.r.t. the Bregman divergence 
D$(0, 0') ee £>(0'||0) with constant 1. In fact, their condition 
for strong convexity holds here with equality. Moreover, for 
any 0, 0' G A we will have 



D(0'||0) = $(0) - $(0') - ( V$(0'), - 0' 



= -^0-0',V 2 $(0")(0-0') 
>#II0-0'II 2 , 



(29) 



where 0" is some point on the line segment joining and 0'. 
Now let us linearize £t(9) around 9 t by defining 



V0 G A. 



£t(9)^£ t (9 t ) + (V£ t (8 t ),8-8 t 
Note that I t (8 t ) = £ t (8 t ) and 

£t(8 t )-£t(8) = -(v£ t (9 t ) 



Hence, substituting 0' = 9 t into (28) and applying (29), we 
obtain 

£t{9t)-£t{0) = it@t)-lt{0)-D(9 t \\9). (30) 

Next, we note that if we were to replace £ t (-) with £ t (-) 
throughout Algorithm 3, then the updates in the algorithm 
would not change. Therefore, applying Lemma 1, we can write 

£t(O t )-£t(0) 

< - (D{9 t \\9) - D{9 t+1 \\8) + D(8 t+1 \\8 t )) , (31) 



£t(0t)-£t(0) 

< - (D(8 t \\8) ~ D(8 t+1 \\8)) - D(8 t \\8) + -D(8 t+1 \\8 t ) 

--l) D(8 t \\8) - -D(9 t+1 \\8) + -D(8 t+1 \\8 t ) 
Jit J Vt Vt 



= l/Vt-l 

A t -A t+1 + -D(8 t+1 \\8 t ) 
Vt 



(32) 



where in the second line we have used the definition r) t — l/t, 
t > 1, and I/7/0 = 0, and in the third line we have defined 
A t = (l/j7 t _i)D(0i||0). We now bound the third term on the 
RHS of (32). To that end, we use a small trick (see, e.g., 
Section 4 in the paper of [13]; this argument, however, is 
standard in the theory of mirror descent [7], [17]). First, 

D(0 t+1 ||0 t )+ J D(0 t ||0; +1 ) 

= (v$(0t + i)-V$(0t),0 t +i -Ot 



i?t+i ~ Ht,Vt+l - 

= -r]t(v£ t (8 t ),8 t+1 
2 



< 



AH 



H 



n+i 



where the first line is due to (7), the second line uses the 
primal-dual relations fb' t , j = V$(0t+i) an d Mt = V$(0t), 
the third line uses the definition of the dual update, and the 
final line is an application of the inequality 



1 



1 



to the vectors u = -T) t V %(8 t ) / 'V%H and v = \f2H(9 t+1 
9t). The goal of this exercise is to be able to write 



£>(0t+i||0t) < 



Vt 
AH 



W t (0 t 
H " 



t+l — v t 



Dm 



<0 



where we have used (29). Moreover, 

V? t (0 t ) <\\h{z t )\\+ V$(0 t ) < 2(K(z T )+M). 



Hence, 



Vt tl 



(33) 



where t+ i = V$*(/2J +1 ) in Algorithm 3. Combining (30) Substituting (33) into (32) and summing from t = 1 to t = T, 
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we obtain 

T T 



t—i t—i 

< E ( A * - A *+i) + — - E 



t=i 



t=i 



<^(^^) + ( ^ +M K o g (T + l), 

where in the last line we have used the estimate 2~2t=i ^ 1 < 
1 + £t~ l dt = logT+1. 



C. Proof of Theorem 2 

The main idea of the proof is similar to that in Theorem 1, 
except that now care must be taken in dealing with the time- 
varying comparison strategy 9 = {6 t }. Applying Lemma 1 to 
^t(-), we write 



l(0 t ) < l(0t)+- ( D(0 t \\6 t ) - D(0 t+1 \\6 t ) + D(9 t+1 \\9 t ) 
Vt 



Adding and subtracting D(8 t +i ||^t+i) inside the parentheses 
and rearranging, we get 

%{6t)-%{0 t ) 

< - (D(6 t \\6 t ) - D(6 t+1 \\e t+1 ) + D(9 t+1 \\9 t+1 ) 

Vt V 



D{6 t+1 \\6 t ) + D{6 t+1 \\6 t 



= A* - A t+1 + ( — - -) D(9 t +i\\0t+i) 
\Vt+i Vt) 

+ -T t + -D(e t+1 \\e t ) 

Vt Vt 

< A t - A t+1 + (— - -) D + -T t + -D(9 t+1 \\9 t ), 
\Vt+i Vt) Vt Vt 

where we have defined A t = (l/r) t )D(6 t \\0t) and T t — 
D(9 t+1 \\9 t+1 ) - D(6 t +i\\9t). Next, we have 



r t = $(0 t+1 ) - $(0 t+1 ) - (V$(6 t +i),0t+i - Ot+i 
- $(0 t ) + $(0 t+1 ) + (V$(d t+1 ),6 t - 9t+i 
= $(0 t+ i) - §(6 t ) + (V*(? t +i), t - t +i) 

< am \\e t - e t+1 \\ . 

Moreover, just as in the proof of Theorem 1, we have 



Vt H 



Combining everything and summing from t = 1 to t = T, we 
obtain 

T T 

t=i *=i 

T T 

<Y / (A t -A t+1 )+Y / (—--)D 
t"f t^\Vt+i Vt) 



i 

+ uiy2-\\e t -9 t+1 \ 



(K(z T )+M) 



t=l 

< Ai - A T+1 



H 



2 t 

E* 



t=i 



i 



i 



Vt+i Vi 

T 



D 



T 

■E* 



+ ^M VT{e)+ W^My 

Vt H 
< D^p!) + dVt + 1 

+ 4MVTV t (9) + {K{zT) + M) \ 2VT - 1). 
H 

In the last line, we have used the estimate Y^t=i t — 
1 + f[t- 1 / 2 dt = 2VT-l. 

D. Proof of Lemma 2 
For each t, we have 



E [A e , i+1 (a; 
' t+i 



t+i z t+i 



)\n t ] 



E 



E[(9 t+1 ,h(z t+1 ) - (f>(x t+1 ))\K t } 



■ E 



E<*., 



= (0i +ll E/i(z t+1 ) - 0(ar t+ i)) + E<*«» ~ 0(*«)> 

8=1 

= + A< M ( 2 ; t ,z*) 

where in the third step we used the fact that 9t+i, 
{9s}s<t, and {-z s } s <t are IZt -measurable, and in the last 
step we used the fact that E[h(z t +i)\TZt] — (f>(xt+i). Thus, 
{A e)t (^,2*),^ t } t > , with Ae, Q (x°,z°) = and K the 
trivial <7-algebra, is a zero-mean martingale, and the desired 
result follows. 

E. Proof of Theorem 5 

We closely follow the proof of Theorem 12.1 in [16]. Let 
i' t (r) denote the subgradient of r H ^t(r). Note that when 
it(j) > 0, l' t (r) = —y t . Thus, when y t ^ yt, the Forecaster 
implements the projected subgradient update r t +\ = n(r t — 
i]£' t (T t )). Define the unprotected update r t+ i = Tt — V^'t{ T t)- 
Whenever y t ^ y t , we may use Lemma 1: 



r,(e t (r t )-£ t (r)) 
< v{t - T t)Vt 



< 



Tt+l) 



(T t - T t+ l) 



(34) 
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where the inequalities hold for every r. Now, at any step at 
which y t ^ y t , i.e., sgn(r t - Ct) Vu the hinge loss l t {r) = 
(1 — (r — Ct]Vt)+ obeys the bound 



l-£ t (r) = 1-(1- (r-Ct)yt)- 
<(t- Ct)Vt 

Therefore, when yt yt, we have 

ri(l-it(r)) 

< v( T - T t)yt + vin - Ct)yt 



(35) 




T t y-(T-T t+1 y + (Tt-T t+1 y 



(36) 



<~ 2 [(r- 

where we used the fact that y t ^ y t implies that sgn(r t — 
Ct) Vu so that (r t — Ct)yt < 0. Note also that when 
y t = y t , we will have r t+1 = r t , and since r t € [T min ,r max ], 
r t+ i = n(r f+ i) = r f . Thus, the very last expression in (36) 
is identically zero when y t — y t . Hence, we get the bound 

?7(l-4(V)) 1 {S't#3/t} 

< \ [(r ~ T t f (r r t+1 ) 2 + (r t - r t+1 ) 2 } (37) 

that holds for all t. Summing from t = 1 to t = T and 
rearranging, we get 

T 

(r - n) 2 + J2( T t ~ Tt+i) 



T 

t=i t=i 



i 

277 



1 

2^ 



l + ^(r t -? t+1 ) 2 



Now, since r t +\ = T t + r]y f l{y t ^y t }, we can bound this further 

by 

T T 1 

Choosing ?; = 1/vT, we obtain the regret bound (24). 

F. Proof of Theorem 6 

We closely follow the proof of Theorem 12.5 in [16]. In- 
troduce the random variables M t — l{y t ^ yt }- Then, whenever 
M t = 1, we have ?7(1-^(t)) < fj(r-Ct)|fe. When M t Ut = 1 
(i.e., when r t is updated to r t+ i), we can use Lemma 1 and 
obtain 

< v(n - Ct)vt + v( T - n)yt 

< V(n - (t)yt + ^ [(t - T t f - (r - r t+1 f + {r t - r t+ i) 2 ] 
From this, we obtain the inequality 

(1 + \p t - T t \)M t U t 

< W + ^ - - (t - r 4+1 ) 2 + (r t - n +1 ) 2 ] , 

which holds for all t. Indeed, if M t Ut = 0, the left-hand 
side is zero, while the right-hand side is greater than zero 



since Itij) > and r t = r t+ i = Tt+i- If M t Ut = 1, then 
Vtin - Ct) = ~(n - Ct) sgn(r t - Ct) = — 1C* — T tl Summing 
over t, we get 

T T 

J2(l + \Ct-T t \)M t U t <J2tt(r) 
t=i t=i 

+ r t ) 2 - (r - r t+1 ) 2 + (r t - r t+l f] . 

^ t=i 

We now take expectation of both sides. Let IZt denote the 
(T-algebra generated by Ui, . . . ,U t , and let E f [-] denote the 
conditional expectation E[-|7£. t _i]. Note that M t and \( t — T t \ 
are measurable w.r.t. IZt-i, since both of them depend on 
U u . . . , U t -i, and that E t U t = 1/(1 + \( t ~ n\). Hence 



E 



Y,(i + Kt-n\)M t u t 



,t=i 



E 



E 



^(l + |Ct-r t |)M t E t C7 t 
t=i 

T 

E^ 



Using the same argument as before with i] = 1/VT, we obtain 



E 



E%^} 



<E^( t ) + ^ 



and the theorem is proved. 
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